https://www.census.gov/data/datasets/time-series/demo/popest/2010s-state-total.html
https://www.iowa-demographics.com/counties_by_population
The dataset we chose for our project looks at the newly-discovered novel coronavirus, which causes coronavirus disease, COVID-19. With nearly all parts of life here in the US and most places around the globe having been affected by the pandemic, it is already easily the most disruptive disease since the Spanish Flu of 1918. On top of the dangerously high number of deaths currently being predicted, the social distancing procedures being implemented to slow the spread of coronavirus appear to be pointing the economy towards a serious economic recession the effects of which may still be felt years from now. Until a vaccine is widely available, it appears that this coronavirus pandemic will be the most important factor affecting the livelihood of nearly every person in this country for several months, and for this reason it is a worthwhile project to study.
Figure 1: Featured infographic
The above interactive graphic is our featured infographic that displays the data we explored. Please use the drop down menu on the right to explore different mappings. Some conclusions on this infographic can be found under the Conclusion section.
The author of the dataset is Sudalai Rajkumar, a popular and highly-rated contributor on Kaggle. Below is his LinkedIn, which displays some of his most impressive accomplishments. From a credibility standpoint, we are reasonably confident that Mr. Rajkumar’s data on the coronavirus is among the most accurate and thorough datasets available to the public. In addition, by checking that some of the cases and testing numbers from his dataset matched the numbers reported by the CDC further illustrated the accuracy of Mr. Rajkumar’s dataset.
https://www.linkedin.com/in/sudalairajkumar/?originalSubdomain=in
Within Mr. Rajkumar’s coronavirus Kaggle page, there are two different .csv files updated daily. One is a file containing coronavirus statistics on the US national level, and the other file contains coronavirus statistics at the US state level. We will be studying the latter file, the .csv file concerning the state-level statistics. We chose the state-level dataset because of our desire to look more closely at the individual states, especially since this pandemic varies widely by location. We also know that there has been a range of different policies put in place at the state level, and we hope to show a summary of the coronavirus situation on a per state level together with the social distancing efforts in that state.
Looking now specifically at our raw state-level .csv file, we see immediately that the dataset is arranged in a “tidy” format, since each observation includes the date of the observation and the state, making these the two keys for that row. As you move along the columns, there are statistics on tests performed cumulatively and on that day, as well as the number of each result (positive, negative, and pending). Additionally, we can see the number of deaths, hospitalizations, recoveries, and ventilators being used.
Finally, this state-level dataset is updated daily, meaning that everyday Mr. Rajkumar adds a little over 50 rows (US states and US territories) that shows the total coronavirus statistics in that state cumulatively up to that point. It is important to note that in general, there is not an accurate method to count the number of recovered patients, so the cumulative statistics offer the best method to gauge the severity of the coronavirus in each state, even though a significant number of the reported cases have recovered by that date.
Some questions that we set out to explore in the analysis include the following:
Once we downloaded our raw dataset from kaggle, we noticed that it did not include population statistics for each state, which is important in order to compare states with different populations. Because of this, we included supplementary data from US census numbers from 2019 and joined this dataset with our raw dataset. In addition, to produce the final two categorical mappings shown in our infographic, we added more supplementary data that included the status of each state’s stay at home order (as of April 30) as well as data that showed the dates in which these orders were enacted.
Cleaning all of this data together required careful joining actions within R and special attention to the data types in these columns. The full breakdown showing how we cleaned this data is shown in approximately the first 65 lines of the Nownes.R file.
Figure 2: Cases/1000 People
We began our exploratory analysis by creating the above figure, showing the number of cases per 1000 people in each state. We see immediately that New York and its surrounding states have the most severe outbreaks of coronavirus at this time.
Figure 3: Cases Per Day
The exploratory analysis of our dataset is unique in the sense that we wanted to hit many different aspects in exploring the data. Our first task we wanted to tackle after cleaning the dataset was looking at the trend of the overall dataset. By using ggplot we then were able to produce a bar graph that allowed us to see the new reported cases by day in the United States over time (Figure 3). The U.S has been reporting cases since January 22nd and so when looking at the graph you can see that there is an extremely low growth rate for the first two months. Then around the middle of March you can see a significant jump in the data as cases increase. But occasionally after the month of April you do see the cases subtly fluctuate. Overall we were able to conclude that cases seem to rapidly increase by day and that really as of right now there seems to be no sign of a long term decrease.
Figure 4: Tests Per Day
Next, we wanted to look at how the number of tests conducted has affected this overall trend. Again by using ggplot we were able to produce the above bar graph that portrayed the number of tests conducted by day in the United States (Figure 4). Viewing both of these two graphs we can conclude that the number of new reported cases by day increases with the number of tests taken by day. It is clear to see a similar trend in both of these graphs as we see increase in each of them day by day. One thing we did want to mention after doing this part of exploring the overall dataset is that we couldn’t fully reflect a real trend of the coronavirus in the US due to the fact that there are mostly likely individuals who have the virus but of course were not tested. Also this could possibly answer why there is no data from the first two months and allowing us to consider that the pandemic might have started earlier than March.
Figure 5: Testing Over Time
Figure 6: New York
Figure 7: Alaska
Figure 8: Wyoming
Next, we wanted to look at the increase in the amount of people tested in each state over time. Using ggplot, we were able to produce a graph that displayed the testing trajectory for each state. The most notable things we found were: New York had the most increase in the amount of people tested, while Wyoming had the least. Additionally, some other states that had a high amount of people tested were California, Florida, and Texas, which makes sense knowing the populations of these states. We then wanted to look at where exactly in the US were the least reported cases and the most reported cases. Using plot_ly, we were able to produce an interactive graph that displayed the number of tests given, the number of positive cases, and the number of deaths for New York and Alaska as each day passed. These states were found using the min and max functions on the “positive” column for May 1st, 2020, which indicated the maximum and minimum number of reported positive cases. The increasing trend we established earlier was still present in these graphs; however, we were also able to see a linear trend, as well as New York beginning to flatten its number of positive cases curve, which is a good sign. For the final part of looking into where the most and least reported cases were located, we wanted to know why Wyoming did not have the least despite having the lowest increase in people tested. To do so, we compared Alaska and Wyoming to find any explanations. We saw that both of these states had relatively small populations (below 1 million); however, Alaska had 195 less positive cases than Wyoming despite having more than two times the amount of people tested. One major difference we found that could explain such a difference is whether or not a stay at home order was issued. Alaska issued theirs on March 25th, 2020, while Wyoming has yet to enact one. This shows the impact that staying at home can have on the number of cases in a state.
Figure 9: Deaths Per Day
Figure 10: Deaths Per Day
Social distancing Data:
https://www.kff.org/health-costs/issue-brief/state-data-and-policy-actions-to-address-coronavirus/ https://www.littler.com/publication-press/publication/stay-top-stay-home-list-statewide